I selected Ford GoBike dataset in order to investigate bike hiring by different people. There are a lot of different variables and our main focus is to define the most common variables which effect the bike hiring program in SF.
The data consisted of 16 different variables such as age, gender, weekday, time and others. It contains 3.31 billion rides. Ages in dataset from 18 to 56 takes 95% of the users in dataset. There were users more than 100 years old. So, we can remove users more than 60 years old. Also, i generated new fields such as age group in order to make grouping and analyze the date by using groups. Ford GoBike spreaded the service to San Francisco, Oakland and San Jose. However, it's hard to imagine traffic. So regarding this complexity, I decided to focus on San Fancisco area.
# import all packages and set plots to be embedded inline
import os
import time
import glob
import numpy as np
import pandas as pd
import helpers as hp
import plotly.express as px
import plotly.graph_objects as go
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
# Reading all csv files at once and append them at one dataframe
path = r"D:\GitLabRepos\GoBike\gobike"
all_files = glob.glob(os.path.join(path, "*.csv"))
print("Concatincating files to one file...")
start_time = time.time()
df = pd.concat(
(pd.read_csv(
file, parse_dates=['start_time', 'end_time', 'member_birth_year'],
dtype={"start_station_id":"O", "end_station_id":"O", "bike_id":"O"},
nrows=10000
) for file in all_files), ignore_index=True
)
# do some data cleaning : drop nan , remove some insignificant features, adding trip duration in minutes feature
df.drop(['start_station_latitude','start_station_longitude', 'end_station_latitude', 'end_station_longitude'], axis=1, inplace=True)
df.dropna(inplace=True)
# adding new columns that help in answering the questions precisely
df["duration_min"] = df["duration_sec"] / 60
df['duration_min_log'] = np.log10(df['duration_min'])
df['start_month'], df['start_day'], df['start_hour'] = (
df['start_time'].dt.month_name(),
df['start_time'].dt.day_name(),
df['start_time'].dt.hour,
)
df["season"] = df.apply(hp.seasons, axis=1)
end_time = time.time()
print("done!")
print("It tooks {} seconds to read, concatnate and wrangle datasets".format(round(end_time - start_time, 2)))
df.head()
Note that the above cells have been set as "Skip"-type slides. That means that when the notebook is rendered as http slides, those cells won't show up.
# Let's plot the distribution of trip duration.
data = go.Histogram(x=df["duration_min_log"])
layout = go.Layout(
title="Distribution of trip duration after log transformation",
xaxis={"showgrid":False, "title":"Duration in min"},
yaxis= {"showgrid":False, "title":"Frequency"}
)
fig = go.Figure(data, layout)
fig.show()
# presenting the 5 number summary
data = go.Box(y=df["duration_min_log"], name="Trip Duration")
layout = go.Layout(
title="Distribution of trip duration after log transformation", xaxis={"showgrid":False}, yaxis= {"showgrid":False}
)
fig = go.Figure(data, layout)
fig.show()
Conclusion 1 : As seen before, it is difficult to read the plot in trip duration per second so that I tend to perform log transformation based on base 10 to plot a normally distributed shape and answer the question precisely. It looks like that most of the trips takes 10 minutes in average - short trips.¶
# seasons vs median duration trips
season_duration_mean = df.groupby('season')['duration_min'].median().reset_index()
fig = go.Figure(
go.Bar(
x=season_duration_mean['season'].tolist(),
y=season_duration_mean['duration_min'].tolist(),
text=round(season_duration_mean['duration_min'], 2).astype(str).tolist(),
textposition="auto"
),
go.Layout(
title="Average of duration trip per season in minutes",
xaxis={"showgrid":False, "title":"Season"},
yaxis={"showgrid":False, "title":"Duration Trip Mean"}
)
)
fig.show()
Conclusion 2 : Due to outliers that exit heavily in this data, I chose to measure the average by median not mean to not mislead the results. Despite there is no significant difference in trip dutaion across seasons, the plot appears that spring has the longest median of trip duration. This was expected for me as in spring we have a very relxing whether experience that motivates going bicycling. Whether doesn't affect that much in SF. I don't know why but this might go back to unchanging extreme whether conditions.¶
fig = go.Figure(
go.Heatmap(
z=df['duration_min_log'].tolist(),
x=df['start_month'].tolist(),
y=df['season'].tolist(),
# hoverongaps = False
),
go.Layout(
title="Relationship between trip duration and months across year",
xaxis={"showgrid":False, "title":"Months"},
yaxis={"showgrid":False, "title":"Trip Duration in Mintues"},
xaxis_type="category"
)
)
fig.show()
fig = go.Figure(
go.Heatmap(
z=df['duration_min_log'].tolist(),
x=df['start_day'].tolist(),
y=df['season'].tolist(),
# hoverongaps = False
),
go.Layout(
title="Relationship between trip duration and days across week",
xaxis={"showgrid":False, "title":"Months"},
yaxis={"showgrid":False, "title":"Trip Duration in Mintues"},
xaxis_type="category"
)
)
fig.show()
Conclusion 3 : I've created season column that to plot multivariate exploration between season, months and trip duration. From the heatmap above, we can see that longest trip durations are in the summer specifically in August. Winter, in Sep and Jan, comes in the second place, while spring months come with shortest trip durations across the year. Unlike the previous bar chart that shows the longest median duratoin trip is in spring, the heatmap suggests the summer as duration count with longest ones. The second heatmap also prove the same fact that the summer has the peak duration of trips with high frequency in Wednesdays.¶
Once you're ready to finish your presentation, check your output by using nbconvert to export the notebook and set up a server for the slides. From the terminal or command line, use the following expression:
jupyter nbconvert <file_name>.ipynb --to slides --post serve --template output_toggleThis should open a tab in your web browser where you can scroll through your presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent slide. Make sure you remove all of the quote-formatted guide notes like this one before you finish your presentation!